References
Intro: Regression models are the workhorse of data science. They are the most well described, practical and theoretically understood models in statistics. A data scientist well versed in regression models will be able to solve an incredible array of problems.
Perhaps the key insight for regression models is that they produce highly interpretable model fits. This is unlike machine learning algorithms, which often sacrifice interpretability for improved prediction performance or automation. These are, of course, valuable attributes in their own rights. However, the benefit of simplicity, parsimony and intrepretability offered by regression models (and their close generalizations) should make them a first tool of choice for any practical problem.
UsingR package# load necessary packages/install if needed
library(UsingR); data(galton)
library(ggplot2)
library(reshape)
long <- melt(galton)
g <- ggplot(long, aes(x = value, fill = variable))
g + geom_histogram(colour = "black", binwidth = 1) + facet_grid(. ~ variable)manipulate function to see what value of \(\mu\) minimizes the sum of the squared deviations # function to plot the histograms
myHist <- function(mu){
# calculate the mean squares
mse <- mean((galton$child - mu)^2)
# plot histogram
g <- ggplot(galton, aes(x = child)) + geom_histogram(fill = "salmon",
colour = "black", binwidth=1)
# add vertical line marking the center value mu
g <- g + geom_vline(xintercept = mu, size = 2)
g <- g + ggtitle(paste("mu = ", mu, ", MSE = ", round(mse, 2), sep = ""))
g
}
# manipulate allows the user to change the variable mu to see how the mean squares changes
# library(manipulate)
# manipulate(myHist(mu), mu = slider(62, 74, step = 0.5))]
# mu that minimizes the sum is the empirical mean
myHist(round(mean(galton$child), 4))in order to visualize the parent-child height relationship, a scatter plot can be used
Note: because there are multiple data points for the same parent/child combination, a third dimension (size of point) should be used when constructing the scatter plot; represents number of points at that \((X, Y)\) combination
library(dplyr)
# constructs table for different combination of parent-child height
freqData <- as.data.frame(table(galton$child, galton$parent))
names(freqData) <- c("child (in)", "parent (in)", "freq")
# convert to numeric values
freqData$child <- as.numeric(as.character(freqData$child))
freqData$parent <- as.numeric(as.character(freqData$parent))
# filter to only meaningful combinations
g <- ggplot(filter(freqData, freq > 0), aes(x = parent, y = child))
g <- g + scale_size(range = c(1, 10), guide = "none" )
# plot grey circles slightly larger than data as base (achieve an outline effect)
g <- g + geom_point(colour="grey50", aes(size = freq+10))
# plot the accurate data points
g <- g + geom_point(aes(colour=freq, size = freq))
# change the color gradient from default to lightblue -> $white
g <- g + scale_colour_gradient(low = "lightblue", high="white")
g# centering data
y <- galton$child - mean(galton$child)
x <- galton$parent - mean(galton$parent)
freqData <- as.data.frame(table(x, y))
names(freqData) <- c("child", "parent", "freq")
freqData$child <- as.numeric(as.character(freqData$child))
freqData$parent <- as.numeric(as.character(freqData$parent))
myPlot <- function(beta){
g <- ggplot(filter(freqData, freq > 0), aes(x = parent, y = child))
g <- g + scale_size(range = c(1, 10), guide = "none" )
g <- g + geom_point(colour="grey50", aes(size = freq+10))
g <- g + geom_point(aes(colour=freq, size = freq))
g <- g + scale_colour_gradient(low = "lightblue", high="white")
g <- g + geom_abline(intercept = 0, slope = beta, size = 3)
mse <- mean( (y - beta * x) ^2 )
g <- g + ggtitle(paste("beta = ", beta, "mse = ", round(mse, 3)))
g
}
# manipulate allows the user to change the slope to see how the mean squares changes
# manipulate(myPlot(beta), beta = slider(0.6, 1.2, step = 0.02))
# slope that minimizes the sum is calculated as follows
beta <- coef(lm(I(child - mean(child)) ~ I(parent - mean(parent)) - 1, data = galton))[1]
myPlot(round(beta, 4))Let \(Y = \beta X\), and \(\hat \beta\) = estimate of \(\beta\), the slope of the least square regression line \[ \begin{aligned} \sum_{i=1}^n (Y_i - X_i \beta)^2 & = \sum_{i=1}^n \left[ (Y_i - X_i \hat \beta) + (X_i \hat \beta - X_i \beta) \right]^2 \Leftarrow \mbox{added } \pm X_i \hat \beta \mbox{ is effectively adding zero}\\ (expanding~the~terms)& = \sum_{i=1}^n (Y_i - X_i \hat \beta)^2 + 2 \sum_{i=1}^n (Y_i - X_i \hat \beta)(X_i \hat \beta - X_i \beta) + \sum_{i=1}^n (X_i \hat \beta - X_i \beta)^2 \\ \sum_{i=1}^n (Y_i - X_i \beta)^2 & \geq \sum_{i=1}^n (Y_i - X_i \hat \beta)^2 + 2 \sum_{i=1}^n (Y_i - X_i \hat \beta)(X_i \hat \beta - X_i \beta) \Leftarrow \sum_{i=1}^n (X_i \hat \beta - X_i \beta)^2 \mbox{ is always positive}\\ & (ignoring~the~second~term~for~now,~for~\hat \beta ~to~be~the~minimizer~of~the~squares, \\ & the~following~must~be~true)\\ \sum_{i=1}^n (Y_i - X_i \beta)^2 & \geq \sum_{i=1}^n (Y_i - X_i \hat \beta)^2 \Leftarrow \hat \beta \mbox{ would have to be the minimizer, because every other } \beta \mbox{ value }\\ & \mbox{creates a least square criteria that is at least as large or larger}\\ & \Rightarrow 2 \sum_{i=1}^n (Y_i - X_i \hat \beta)(X_i \hat \beta - X_i \beta) = 0 \Leftarrow \mbox{if we can make this term zero, then we've found our estimate}\\ (simplifying)& \Rightarrow \sum_{i=1}^n (Y_i - X_i \hat \beta) X_i (\hat \beta - \beta) = 0 \Leftarrow (\hat \beta - \beta) \mbox{ does not depend on }i\\ (simplifying)& \Rightarrow \sum_{i=1}^n (Y_i - X_i \hat \beta) X_i = 0 \\ (solving~for~\hat \beta) & \Rightarrow \hat \beta = \frac{\sum_{i=1}^n Y_i X_i}{\sum_{i=1}^n X_i^2}\\ \end{aligned} \]
Intro: Ordinary least squares (OLS) is the workhorse of statistics. It gives a way of taking complicated outcomes and explaining behavior (such as trends) using linearity. The simplest application of OLS is fitting a line through some data. In the next few lectures, we cover the basics of linear least squares.
best fitted line for predictor, \(X\), and outcome, \(Y\) is derived from the least squares \[\sum_{i=1}^n \{Y_i - (\beta_0 + \beta_1 X_i)\}^2\]
coef(lm(y ~ x))) = R command to run the least square regression model on the data with y as the outcome, and x as the regressor
coef() = returns the slope and intercept coefficients of the lm results# outcome
y <- galton$child
# regressor
x <- galton$parent
# slope
beta1 <- cor(y, x) * sd(y) / sd(x)
# intercept
beta0 <- mean(y) - beta1 * mean(x)
# results are the same as using the lm command
results <- rbind("manual" = c(beta0, beta1), "lm(y ~ x)" = coef(lm(y ~ x)))
# set column names
colnames(results) <- c("intercept", "slope")
# print results
results## intercept slope
## manual 23.94153 0.6462906
## lm(y ~ x) 23.94153 0.6462906
lm(y ~ x - 1) = forces a regression line to go through the origin (0, 0)### Regression through the origin yields an equivalent slope if you center the data first
# centering y
yc <- y - mean(y)
# centering x
xc <- x - mean(x)
# slope
beta1 <- sum(yc * xc) / sum(xc ^ 2)
# results are the same as using the lm command
results <- rbind("centered data (manual)" = beta1, "lm(y ~ x)" = coef(lm(y ~ x))[2],
"lm(yc ~ xc - 1)" = coef(lm(yc ~ xc - 1))[1])
# set column names
colnames(results) <- c("slope")
# print results
results## slope
## centered data (manual) 0.6462906
## lm(y ~ x) 0.6462906
## lm(yc ~ xc - 1) 0.6462906
### Normalizing variables results in the slope being the correlation
# normalize y
yn <- (y - mean(y))/sd(y)
# normalize x
xn <- (x - mean(x))/sd(x)
# compare correlations
results <- rbind("cor(y, x)" = cor(y, x), "cor(yn, xn)" = cor(yn, xn),
"slope" = coef(lm(yn ~ xn))[2])
# print results
results## xn
## cor(y, x) 0.4587624
## cor(yn, xn) 0.4587624
## slope 0.4587624
geom_smooth(method = "lm", formula = y~x) function in ggplot2 = adds regression line and confidence interval to graph
formula = y~x = default for the line (argument can be eliminated if y~x produces the line you want)# constructs table for different combination of parent-child height
freqData <- as.data.frame(table(galton$child, galton$parent))
names(freqData) <- c("child (in)", "parent (in)", "freq")
# convert to numeric values
freqData$child <- as.numeric(as.character(freqData$child))
freqData$parent <- as.numeric(as.character(freqData$parent))
g <- ggplot(filter(freqData, freq > 0), aes(x = parent, y = child))
g <- g + scale_size(range = c(1, 10), guide = "none" )
g <- g + geom_point(colour="grey50", aes(size = freq+10))
g <- g + geom_point(aes(colour=freq, size = freq))
g <- g + scale_colour_gradient(low = "lightblue", high="white")
g <- g + geom_smooth(method="lm", formula=y~x)
g# load father.son data
data(father.son)
# normalize son's height
y <- (father.son$sheight - mean(father.son$sheight)) / sd(father.son$sheight)
# normalize father's height
x <- (father.son$fheight - mean(father.son$fheight)) / sd(father.son$fheight)
# calculate correlation
rho <- cor(x, y)
# plot the relationship between the two
g = ggplot(data.frame(x = x, y = y), aes(x = x, y = y))
g = g + geom_point(size = 3, alpha = .2, colour = "black")
g = g + geom_point(size = 2, alpha = .2, colour = "salmon")
g = g + xlim(-4, 4) + ylim(-4, 4)
# reference line for perfect correlation between
# variables (data points will like on diagonal line)
g = g + geom_abline()
# if there is no correlation between the two variables,
# the data points will lie on horizontal/vertical lines
g = g + geom_vline(xintercept = 0)
g = g + geom_hline(yintercept = 0)
# plot the actual correlation for both
g = g + geom_abline(intercept = 0, slope = rho, size = 2)
g = g + geom_abline(intercept = 0, slope = 1 / rho, size = 2)
# add appropriate labels
g = g + xlab("Father's height, normalized")
g = g + ylab("Son's height, normalized")
g = g + geom_text(x = 3.5, y = 1.5, label="son~father", angle = 25) +
geom_text(x = 3.2, y = 3.6, label="cor(y,x)=1", angle = 35) +
geom_text(x = 1.6, y = 3.7, label="father~son", angle = 60)
git may be useful to move the intercept at times \begin{aligned} Y_i &= _0 + _1 X_i + _i\ &= _0 + a _1 + _1 (X_i - a) + _i \ &= _0 + _1 (X_i - a) + _i ~ ~ ~ ~ where _0 = _0 + a _1\ \end{aligned}
often, \(a\) is set to \(\bar X\) so that the intercept is interpreted as the expected response at the average \(X\) value
summary(lm))
diamond dataset from UsingR package
lm(price ~ I(carat - mean(carat)), data=diamond) = mean centered linear regression
I() to work predict(fitModel, newdata=data.frame(carat=c(0, 1, 2))) = returns predicted outcome from the given model (linear in our case) at the provided points within the newdata data frame
newdata is unspecified (argument omitted), then predict function will return predicted values for all observed values of the predictor (x variable, carat in this case)
newdata has to be a dataframe, and the values you would like to predict (x variable, carat in this case) has to be specified, or the system won’t know what to do with the provided values summary(fitModel) = prints detailed summary of linear modellibrary(UsingR)
data(diamond)
# standard linear regression for price vs carat
fit <- lm(price ~ carat, data = diamond)
# intercept and slope
coef(fit)## (Intercept) carat
## -259.6259 3721.0249
# mean-centered regression
fit2 <- lm(price ~ I(carat - mean(carat)), data = diamond)
# intercept and slope
coef(fit2)## (Intercept) I(carat - mean(carat))
## 500.0833 3721.0249
# regression with more granular scale (1/10th carat)
fit3 <- lm(price ~ I(carat * 10), data = diamond)
# intercept and slope
coef(fit3)## (Intercept) I(carat * 10)
## -259.6259 372.1025
# predictions for 3 values
newx <- c(0.16, 0.27, 0.34)
# manual calculations
coef(fit)[1] + coef(fit)[2] * newx## [1] 335.7381 745.0508 1005.5225
# prediction using the predict function
predict(fit, newdata = data.frame(carat = newx))## 1 2 3
## 335.7381 745.0508 1005.5225
# plot the data points
plot(diamond$carat, diamond$price, xlab = "Mass (carats)", ylab = "Price (SIN $)",
bg = "lightblue", col = "black", cex = 1.1, pch = 21,frame = FALSE)
# plot linear regression line
abline(fit, lwd = 2)
# plot predictions for every value of carat (in red)
points(diamond$carat, predict(fit), pch = 19, col = "red")
# add guidelines for predictions for 0.16, 0.27, and 0.34
lines(c(0.16, 0.16, 0.12), c(200, coef(fit)[1] + coef(fit)[2] * 0.16,
coef(fit)[1] + coef(fit)[2] * 0.16))
lines(c(0.27, 0.27, 0.12), c(200, coef(fit)[1] + coef(fit)[2] * 0.27,
coef(fit)[1] + coef(fit)[2] * 0.27))
lines(c(0.34, 0.34, 0.12), c(200, coef(fit)[1] + coef(fit)[2] * 0.34,
coef(fit)[1] + coef(fit)[2] * 0.34))
# add text labels
text(newx+c(0.03, 0, 0), rep(250, 3), labels = newx, pos = 2)Intro: Residuals represent variation left unexplained by our model. We emphasize the difference between residuals and errors. The errors are unobservable true errors from the known coefficients, while residuals are the observable errors from the estimated coefficients. In a sense, the residuals are estimates of the errors.
mean(fitModel$residuals) = returns mean of residuals \(\rightarrow\) should equal to 0cov(fit$residuals, predictors) = returns the covariance (measures correlation) of residuals and predictors \(\rightarrow\) should also equal to 0one needs to differentiate residual variation (variation after removing the predictor) from systematic varation (variation explained by the regression model)
deviance(fitModel) = calculates sum of the squared error/residual for the linear model/residual variationsummary(fitModel)$sigma = returns the residual variation of a fit model or the unbiased RMSE
summary(fitModel) = creates a list of different parameters of the fit model# get data
y <- diamond$price; x <- diamond$carat; n <- length(y)
# linear fit
fit <- lm(y ~ x)
# calculate residual standard error through summary and manually
rbind("from summary" = summary(fit)$sigma, "manual" =sqrt(sum(resid(fit)^2) / (n - 2)))## [,1]
## from summary 31.84052
## manual 31.84052
cor(outcome, predictor = calculates the correlation between predictor and outcome \(\rightarrow\) the same as calculating \(R^2\)data(anscombe); example(anscombe) demonstrates the fallacy of \(R^2\) through the following
relationship between \(R^2\) and \(r\) \[ \begin{aligned} \mbox{Correlation between X and Y} \Rightarrow r & = Cor(Y, X)\\ R^2 & = \frac{\sum_{i=1}^n (\hat Y_i - \bar Y)^2}{\sum_{i=1}^n (Y_i - \bar Y)^2} \\ &recall \Rightarrow (\hat Y_i - \bar Y) = \hat \beta_1 (X_i - \bar X)\\ (substituting (\hat Y_i - \bar Y)) & = \hat \beta_1^2 \frac{\sum_{i=1}^n(X_i - \bar X)^2}{\sum_{i=1}^n (Y_i - \bar Y)^2}\\ &recall \Rightarrow \hat \beta_1 = Cor(Y, X)\frac{Sd(Y)}{Sd(X)}\\ (substituting \hat \beta_1) & = Cor(Y, X)^2\frac{Var(Y)}{Var(X)} \times \frac{\sum_{i=1}^n(X_i - \bar X)^2}{\sum_{i=1}^n (Y_i - \bar Y)^2}\\ & recall ~ Var(Y) = \sum_{i=1}^n (Y_i - \bar Y)^2 ~ and ~ Var(X) = \sum_{i=1}^n (X_i - \bar X)^2\\ (simplifying) \Rightarrow R^2 &= Cor(Y,X)^2\\ \mbox{Or } R^2 & = r^2\\ \end{aligned} \]
total variation derivation \[ \begin{aligned} \mbox{First, we know that } \bar Y & = \hat \beta_0 + \hat \beta_1 \bar X \\ (transforming) \Rightarrow \hat \beta_0 & = \bar Y - \hat \beta_1 \bar X \\ \\ \mbox{We also know that } \hat Y_i & = \hat \beta_0 + \hat \beta_1 X_i \\ \\ \mbox{Next, the residual } = (Y_i - \hat Y_i) & = Y_i - \hat \beta_0 - \hat \beta_1 X_i \\ (substituting~\hat \beta_0) & = Y_i - (\bar Y - \hat \beta_1 \bar X) - \hat \beta_1 X_i \\ (transforming) \Rightarrow (Y_i - \hat Y_i) & = (Y_i - \bar Y) - \hat \beta_1 (X_i - \bar X) \\ \\ \mbox{Next, the regression difference } = (\hat Y_i - \bar Y) & = \hat \beta_0 - \hat \beta_1 X_i -\bar Y \\ (substituting~\hat \beta_0) & = \bar Y - \hat \beta_1 \bar X - \hat \beta_1 X_i - \bar Y \\ (transforming) \Rightarrow (\hat Y_i - \bar Y) & = \hat \beta_1 (X_i - \bar X) \\ \\ \mbox{Total Variation} = \sum_{i=1}^n (Y_i - \bar Y)^2 & = \sum_{i=1}^n (Y_i - \hat Y_i + \hat Y_i - \bar Y)^2 \Leftarrow (adding~\pm \hat Y_i) \\ (expanding) & = \sum_{i=1}^n (Y_i - \hat Y_i)^2 + 2 \sum_{i=1}^n (Y_i - \hat Y_i)(\hat Y_i - \bar Y) + \sum_{i=1}^n (\hat Y_i - \bar Y)^2 \\ \\ \mbox{Looking at } \sum_{i=1}^n (Y_i - \hat Y_i)(\hat Y_i - \bar Y) & \\ (substituting~(Y_i - \hat Y_i),(\hat Y_i - \bar Y)) &= \sum_{i=1}^n \left[(Y_i - \bar Y) - \hat \beta_1 (X_i - \bar X))\right]\left[\hat \beta_1 (X_i - \bar X)\right]\\ (expanding) & = \hat \beta_1 \sum_{i=1}^n (Y_i - \bar Y)(X_i - \bar X) -\hat\beta_1^2\sum_{i=1}^n (X_i - \bar X)^2\\ & (substituting~Y_i, \bar Y) \Rightarrow (Y_i - \bar Y) = (\hat \beta_0 + \hat \beta_1 X_i) - (\hat \beta_0 + \hat \beta_1 \bar X) \\ & (simplifying) \Rightarrow (Y_i - \bar Y) = \hat \beta_1 (X_i - \bar X) \\ (substituting~(Y_i - \bar Y)) & = \hat \beta_1^2 \sum_{i=1}^n (X_i - \bar X)^2-\hat\beta_1^2\sum_{i=1}^n (X_i - \bar X)^2\\ \Rightarrow \sum_{i=1}^n (Y_i - \hat Y_i)(\hat Y_i - \bar Y) &= 0 \\ \\ \mbox{Going back to} \sum_{i=1}^n (Y_i - \bar Y)^2 &= \sum_{i=1}^n (Y_i - \hat Y_i)^2 + 2 \sum_{i=1}^n (Y_i - \hat Y_i)(\hat Y_i - \bar Y) + \sum_{i=1}^n (\hat Y_i - \bar Y)^2 \\ (since~second~term= 0) \Rightarrow \sum_{i=1}^n (Y_i - \bar Y)^2 &= \sum_{i=1}^n (Y_i - \hat Y_i)^2 + \sum_{i=1}^n (\hat Y_i - \bar Y)^2\\ \end{aligned} \]
resid(fitModel) or fitModel$residuals = extracts model residuals from the fit model (lm in our case) \(\rightarrow\) list of residual values for every value of Xsummary(fitModel)$r.squared = return \(R^2\) value of the regression model# load multiplot function
source("./scripts/multiplot.R")
# get data
y <- diamond$price; x <- diamond$carat; n <- length(y)
# linear regression
fit <- lm(y ~ x)
# calculate residual
e <- resid(fit)
# calculate predicted values
yhat <- predict(fit)
# show that residuals are the differences between y and yhat
max(abs(e - (y - yhat)))## [1] 9.485746e-13
# show that sum of residuals and
# sum of residuals times regressor equals 0
rbind("sum(e)" = sum(e), "sum(e*x)" = sum(e*x))## [,1]
## sum(e) -1.865175e-14
## sum(e*x) 6.959711e-15
# create 1 x 2 panel plot
par(mfrow=c(1, 2))
# plot residuals on regression line
plot(x, y, xlab = "Mass (carats)", ylab = "Price (SIN $)", bg = "lightblue",
col = "black", cex = 2, pch = 21,frame = FALSE,main = "Residual on Regression Line")
# draw linear regression line
abline(fit, lwd = 2)
# draw red lines from data points to regression line
for (i in 1 : n){lines(c(x[i], x[i]), c(y[i], yhat[i]), col = "red" , lwd = 2)}
# plot residual vs x
plot(x, e, xlab = "Mass (carats)", ylab = "Residuals (SIN $)", bg = "lightblue",
col = "black", cex = 2, pch = 21,frame = FALSE,main = "Residual vs X")
# draw horizontal line
abline(h = 0, lwd = 2)
# draw red lines from residual to x axis
for (i in 1 : n){lines(c(x[i], x[i]), c(e[i], 0), col = "red" , lwd = 2)}# create sin wave pattern
x <- runif(100, -3, 3); y <- x + sin(x) + rnorm(100, sd = .2);
# plot data + regression
g <- ggplot(data.frame(x = x, y = y), aes(x = x, y = y)) +
geom_smooth(method = "lm", colour = "black") +
geom_point(size = 3, colour = "black", alpha = 0.4) +
geom_point(size = 2, colour = "red", alpha = 0.4)+
ggtitle("Residual on Regression Line")
# plot residuals
f <- ggplot(data.frame(x = x, y = resid(lm(y ~ x))), aes(x = x, y = y))+
geom_hline(yintercept = 0, size = 2)+
geom_point(size = 3, colour = "black", alpha = 0.4)+
geom_point(size = 2, colour = "red", alpha = 0.4)+
xlab("X") + ylab("Residual")+ ggtitle("Residual vs X")
multiplot(g, f, cols = 2)# create heteroskedastic data
x <- runif(100, 0, 6); y <- x + rnorm(100, mean = 0, sd = .001 * x)
# plot data + regression
g <- ggplot(data.frame(x = x, y = y), aes(x = x, y = y))+
geom_smooth(method = "lm", colour = "black")+
geom_point(size = 3, colour = "black", alpha = 0.4)+
geom_point(size = 2, colour = "red", alpha = 0.4) +
ggtitle("Residual on Regression Line")
# plot residuals
f <- ggplot(data.frame(x = x, y = resid(lm(y ~ x))), aes(x = x, y = y))+
geom_hline(yintercept = 0, size = 2) +
geom_point(size = 3, colour = "black", alpha = 0.4)+
geom_point(size = 2, colour = "red", alpha = 0.4)+
xlab("X") + ylab("Residual") + ggtitle("Residual vs X")
# combine two plots
multiplot(g, f, cols = 2)# plot residuals using ggplot
diamond$e <- resid(lm(price ~ carat, data = diamond))
g = ggplot(diamond, aes(x = carat, y = e))
g = g + xlab("Mass (carats)")
g = g + ylab("Residual price (SIN $)")
g = g + geom_hline(yintercept = 0, size = 2)
g = g + geom_point(size = 3, colour = "black", alpha=0.5)
g = g + geom_point(size = 2, colour = "blue", alpha=0.2)
# Create 2 residual vectors:
# FIRST residual vector is just a fit with an intercept
# So the residuals are just the deviations around the average price
# SECOND vector is the residuals around the regression line
# with carats as the explanatory variable
e = c(resid(lm(price ~ 1, data = diamond)),
resid(lm(price ~ carat, data = diamond)))
fit = factor(c(rep("Itc", nrow(diamond)),
rep("Itc, slope", nrow(diamond))))
f = ggplot(data.frame(e = e, fit = fit), aes(y = e, x = fit, fill = fit))
f = f + geom_dotplot(binaxis = "y", size = 2, stackdir = "center", binwidth = 20)
f = f + xlab("Fitting approach")
f = f + ylab("Residual price")
multiplot(g, f, cols = 2)Intro: Inference is the process of drawing conclusions about a population using a sample. In statistical inference, we must account for the uncertainty in our estimates in a principled way. Hypothesis tests and confidence intervals are among the most common forms of statistical inference.
These statements apply generally, and, of course, to the regression setting as well. In the next few lectures, we’ll cover inference in regression where we make some Gaussian assumptions about the errors.
standard errors for coefficients \[\begin{aligned} Var(\hat \beta_1) & = Var\left(\frac{\sum_{i=1}^n (Y_i - \bar Y)(X_i - \bar X)}{\sum_{i=1}^n (X_i - \bar X)^2}\right) \\ (expanding) & = Var\left(\frac{\sum_{i=1}^n Y_i (X_i - \bar X) - \bar Y \sum_{i=1}^n (X_i - \bar X)}{\sum_{i=1}^n (X_i - \bar X)^2}\right) \\ & Since~ \sum_{i=1}^n X_i - \bar X = 0 \\ (simplifying) & = \frac{Var \left(\sum_{i=1}^n Y_i (X_i - \bar X) \right)}{(\sum_{i=1}^n (X_i - \bar X)^2)^2} \Leftarrow \mbox{denominator taken out of } Var\\ (Var(Y_i) = \sigma^2) & = \frac{\sigma^2 \sum_{i=1}^n (X_i - \bar X)^2}{(\sum_{i=1}^n (X_i - \bar X)^2)^2} \\ \sigma_{\hat \beta_1}^2 = Var(\hat \beta_1) &= \frac{\sigma^2 }{ \sum_{i=1}^n (X_i - \bar X)^2 }\\ \Rightarrow \sigma_{\hat \beta_1} &= \frac{\sigma}{ \sum_{i=1}^n X_i - \bar X} \\ \\ \mbox{by the same derivation} \Rightarrow & \\ \sigma_{\hat \beta_0}^2 = Var(\hat \beta_0) & = \left(\frac{1}{n} + \frac{\bar X^2}{\sum_{i=1}^n (X_i - \bar X)^2 }\right)\sigma^2 \\ \Rightarrow \sigma_{\hat \beta_0} &= \sigma \sqrt{\frac{1}{n} + \frac{\bar X^2}{\sum_{i=1}^n (X_i - \bar X)^2 }}\\ \end{aligned}\]
summary(fitModel)$coefficients = returns table summarizing the estimate, standar error, t value and p value for the coefficients \(\beta_0\) and \(\beta_1\)
summary(fitModel)$coefficients producesExample
# getting data
y <- diamond$price; x <- diamond$carat; n <- length(y)
# calculate beta1
beta1 <- cor(y, x) * sd(y) / sd(x)
# calculate beta0
beta0 <- mean(y) - beta1 * mean(x)
# Gaussian regression error
e <- y - beta0 - beta1 * x
# unbiased estimate for variance
sigma <- sqrt(sum(e^2) / (n-2))
# sum(X_i - X Bar)^2
ssx <- sum((x - mean(x))^2)
# calculate standard errors
seBeta0 <- (1 / n + mean(x) ^ 2 / ssx) ^ .5 * sigma
seBeta1 <- sigma / sqrt(ssx)
# statistics for beta0 and beta1; testing for H0: beta0 = 0 and beta1 = 0
tBeta0 <- beta0 / seBeta0; tBeta1 <- beta1 / seBeta1
# calculating p-values for Ha: beta0 != 0 and beta0 != 0 (two sided)
pBeta0 <- 2 * pt(abs(tBeta0), df = n - 2, lower.tail = FALSE)
pBeta1 <- 2 * pt(abs(tBeta1), df = n - 2, lower.tail = FALSE)
# store results into table
coefTable <- rbind(c(beta0, seBeta0, tBeta0, pBeta0), c(beta1, seBeta1, tBeta1, pBeta1))
colnames(coefTable) <- c("Estimate", "Std. Error", "t value", "P(>|t|)")
rownames(coefTable) <- c("(Intercept)", "x")
# print table
coefTable## Estimate Std. Error t value P(>|t|)
## (Intercept) -259.6259 17.31886 -14.99094 2.523271e-19
## x 3721.0249 81.78588 45.49715 6.751260e-40
# regression model and the generated table from lm (identical to above)
fit <- lm(y ~ x)
summary(fit)$coefficients## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -259.6259 17.31886 -14.99094 2.523271e-19
## x 3721.0249 81.78588 45.49715 6.751260e-40
# store results in matrix
sumCoef <- summary(fit)$coefficients
# print out confidence interval for beta0
sumCoef[1,1] + c(-1, 1) * qt(.975, df = fit$df) * sumCoef[1, 2]## [1] -294.4870 -224.7649
# print out confidence interval for beta1 in 1/10 units
(sumCoef[2,1] + c(-1, 1) * qt(.975, df = fit$df) * sumCoef[2, 2]) / 10## [1] 355.6398 388.5651
# using built-in function
confint(fit)## 2.5 % 97.5 %
## (Intercept) -294.487 -224.7649
## x 3556.398 3885.6513
predict(fitModel, data, interval = ("confidence")) = returns a 3-column matrix with data for fit (regression line), lwr (lower bound of interval), and upr (upper bound of interval)
interval = ("confidence") = returns interval for the lineinterval = ("prediction") = returns interval for the predictiondata = must be a new data frame with the values you would like to predictggplot2)# create a sequence of values that we want to predict at
newx = data.frame(x = seq(min(x), max(x), length = 100))
# calculate values for both intervals
p1 = data.frame(predict(fit, newdata = newx, interval = ("confidence")))
p2 = data.frame(predict(fit, newdata = newx, interval = ("prediction")))
# add column for interval labels
p1$interval = "confidence"; p2$interval = "prediction"
# add column for the x values we want to predict
p1$x = newx$x; p2$x = newx$x
# combine the two dataframes
dat = rbind(p1, p2)
# change the name of the first column to y
names(dat)[1] = "y"
# plot the data
g <- ggplot(dat, aes(x = x, y = y))
g <- g + geom_ribbon(aes(ymin = lwr, ymax = upr, fill = interval), alpha = 0.2)
g <- g + geom_line()
g + geom_point(data = data.frame(x = x, y=y), aes(x = x, y = y), size = 4)base)# plot the x and y values
plot(x,y,frame=FALSE,xlab="Carat",ylab="Dollars",pch=21,col="black",bg="lightblue",cex=2)
# add the fit line
abline(fit,lwd=2)
# create sequence of x values that we want to predict at
xVals<-seq(min(x),max(x),by=.01)
# calculate the predicted y values
yVals<-beta0+beta1*xVals
# calculate the standard errors for the interval for the line
se1<-sigma*sqrt(1/n+(xVals-mean(x))^2/ssx)
# calculate the standard errors for the interval for the predicted values
se2<-sigma*sqrt(1+1/n+(xVals-mean(x))^2/ssx)
# plot the upper and lower bounds of both intervals
lines(xVals,yVals+2*se1, col = "red"); lines(xVals,yVals-2*se1, col = "red")
lines(xVals,yVals+2*se2, col = "blue"); lines(xVals,yVals-2*se2, col = "blue")Intro: We now extend linear regression so that our models can contain more variables. A natural first approach is to assume additive effects, basically extending our line to a plane, or generalized version of a plane as we add more variables. Multivariable regression represents one of the most widely used and successful methods in statistics.
performing multivariate regression = pick any regressor and replace the outcome and all other regressors by their residuals against the chosen one
we can hold \(\hat \beta_1\) fixed and solve (2) for \(\hat \beta_2\) \[ \begin{aligned} \sum_{i=1}^n (Y_i - X_{1i}\hat \beta_1) X_{2i} - \sum_{i=1}^n X_{2i}^2 \hat \beta_2& = 0\\ \sum_{i=1}^n (Y_i - X_{1i}\hat \beta_1) X_{2i} &= \sum_{i=1}^n X_{2i}^2 \hat \beta_2 \\ \hat \beta_2 & = \frac{\sum_{i=1}^n (Y_i - X_{1i}\hat \beta_1) X_{2i}}{\sum_{i=1}^n X_{2i}^2} \\ \end{aligned} \]
we can rewrite \[\sum_{i=1}^n \left[\left(Y_i - \frac{\sum_{j=1}^n Y_jX_{2j}}{\sum_{j=1}^n X_{2j}^2}X_{2i}\right) - \hat \beta_1 \left(X_{1i} - \frac{\sum_{j=1}^n X_{1j} X_{2j}}{\sum_{j=1}^n X_{2j}^2}X_{2i}\right)\right] X_{1i} = 0 ~ ~ ~ ~ ~ ~(3)\] as \[ \sum_{i=1}^n \left[ e_{i, Y|X_2} -\hat \beta_1 e_{i, X_1|X_2} \right] X_{1i}= 0\] where \[e_{i, a|b} = a_i - \frac{ \sum_{j=1}^n a_j b_j}{ \sum_{j=1}^n b_j^2}b_i\] which is interpreted as the residual when regressing \(b\) from \(a\) without an intercept
We get the solution \[\hat \beta_1 = \frac{\sum_{i=1}^n e_{i, Y | X_2} e_{i, X_1 | X_2}}{\sum_{i=1}^n e_{i, X_1 | X_2}X_{1i}} ~ ~ ~ ~ ~ ~(4)\]
But note that \[ \begin{aligned} \sum_{i=1}^n e_{i, X_1 | X_2}^2 & = \sum_{i=1}^n e_{i, X_1 | X_2}\left(X_{1i} - \frac{\sum_{j=1}^n X_{1j}X_{2j} }{ \sum_{j=1}^n X_{2j}^2 } X_{2i}\right) \\ & = \sum_{i=1}^n e_{i, X_1 | X_2}X_{1i} - \frac{\sum_{j=1}^n X_{1j}X_{2j} }{ \sum_{j=1}^n X_{2j}^2 } \sum_{i=1}^n e_{i, X_1 | X_2} X_{2i}\\ & (recall~that~ \sum_{i=1}^n e_{i}X_i = 0,~so~the~2^{nd}~term~is~0) \\ \Rightarrow \sum_{i=1}^n e_{i, X_1 | X_2}^2 & = \sum_{i=1}^n e_{i, X_1 | X_2}X_{1i}\\ \end{aligned} \]
plugging the above equation back into (4), we get \[\hat \beta_1 = \frac{\sum_{i=1}^n e_{i, Y | X_2} e_{i, X_1 | X_2}}{\sum_{i=1}^n e_{i, X_1 | X_2}^2}\]
this is interpreted as the effect of variable \(X_1\) when the effects of all other variables have been removed from \(X_1\) and the predicted result \(Y\) (holding everything else constant/adjusting for all other variables)
# simulate the data
n = 100; x = rnorm(n); x2 = rnorm(n); x3 = rnorm(n)
# equation = intercept + var1 + var2 + var3 + error
y = 1 + x + x2 + x3 + rnorm(n, sd = .1)
# residual of y regressed on var2 and var3
ey = resid(lm(y ~ x2 + x3))
# residual of y regressed on var2 and var3
ex = resid(lm(x ~ x2 + x3))
# estimate beta1 for var1
sum(ey * ex) / sum(ex ^ 2)## [1] 0.9958997
# regression through the origin with xva1 with var2 var3 effect removed
coef(lm(ey ~ ex - 1))## ex
## 0.9958997
# regression for all three variables
coef(lm(y ~ x + x2 + x3))## (Intercept) x x2 x3
## 0.9926959 0.9958997 0.9818908 0.9904702
?swiss GGally package] ggpairs(data) = produces pair wise plot for the predictors similar to pairs in base package# load dataset
require(datasets); data(swiss); require(GGally)
# produce pairwise plot using ggplot2
ggpairs(swiss, lower = list(continuous = wrap("smooth", method = "lm")))# print coefficients of regression of fertility on all predictors
summary(lm(Fertility ~ . , data = swiss))$coefficients## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 66.9151817 10.70603759 6.250229 1.906051e-07
## Agriculture -0.1721140 0.07030392 -2.448142 1.872715e-02
## Examination -0.2580082 0.25387820 -1.016268 3.154617e-01
## Education -0.8709401 0.18302860 -4.758492 2.430605e-05
## Catholic 0.1041153 0.03525785 2.952969 5.190079e-03
## Infant.Mortality 1.0770481 0.38171965 2.821568 7.335715e-03
# run marginal regression on Agriculture
summary(lm(Fertility ~ Agriculture, data = swiss))$coefficients## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 60.3043752 4.25125562 14.185074 3.216304e-18
## Agriculture 0.1942017 0.07671176 2.531577 1.491720e-02
# simulate data
n <- 100; x2 <- 1 : n; x1 <- .01 * x2 + runif(n, -.1, .1); y = -x1 + x2 + rnorm(n, sd = .01)
# print coefficients
c("with x1" = summary(lm(y ~ x1))$coef[2,1],
"with x1 and x2" = summary(lm(y ~ x1 + x2))$coef[2,1])## with x1 with x1 and x2
## 94.2393492 -0.9952328
# print p-values
c("with x1" = summary(lm(y ~ x1))$coef[2,4],
"with x1 and x2" = summary(lm(y ~ x1 + x2))$coef[2,4])## with x1 with x1 and x2
## 2.296891e-70 5.009692e-76
# store all data in one data frame (ey and ex1 are residuals with respect to x2)
dat <- data.frame(y = y, x1 = x1, x2 = x2, ey = resid(lm(y ~ x2)), ex1 = resid(lm(x1 ~ x2)))
# plot y vs x1
g <- ggplot(dat, aes(y = y, x = x1, colour = x2)) +
geom_point(colour="grey50", size = 2) +
geom_smooth(method = lm, se = FALSE, colour = "black") + geom_point(size = 1.5) +
ggtitle("unadjusted = y vs x1")
# plot residual of y adjusted for x2 vs residual of x1 adjusted for x2
g2 <- ggplot(dat, aes(y = ey, x = ex1, colour = x2)) +
geom_point(colour="grey50", size = 2) +
geom_smooth(method = lm, se = FALSE, colour = "black") + geom_point(size = 1.5) +
ggtitle("adjusted = y, x1 residuals with x2 removed") + labs(x = "resid(x1~x2)",
y = "resid(y~x2)")
# combine plots
multiplot(g, g2, cols = 2)y and x1 flips signs when adjusting for x2this effectively means that within each consecutive group/subset of points (each color gradient) on the left hand plot (unadjusted), there exists a negative relationship between the points while the overall trend is going up
Note: it is difficult to interpret and determine which one is the correct model \(\rightarrow\) one should not claim positive correlation between Agriculture and Fertility simply based on marginal regression lm(Fertility ~ Agriculture, data=swiss)
NA as their coefficients# add a linear combination of agriculture and education variables
z <- swiss$Agriculture + swiss$Education
# run linear regression with unnecessary variables
lm(Fertility ~ . + z, data = swiss)$coef## (Intercept) Agriculture Examination Education
## 66.9151817 -0.1721140 -0.2580082 -0.8709401
## Catholic Infant.Mortality z
## 0.1041153 1.0770481 NA
z by excluding it from the linear regression \(\rightarrow\) z’s coefficient is NAInsectSprays data set
# load insect spray data
data(InsectSprays)
ggplot(data = InsectSprays, aes(y = count, x = spray, fill = spray)) +
geom_violin(colour = "black", size = 2) + xlab("Type of spray") +
ylab("Insect count")Linear model fit with group A as reference category
# linear fit with 5 dummy variables
fit <- summary(lm(count ~ spray, data = InsectSprays))$coefficients; fit## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.5000000 1.132156 12.8074279 1.470512e-19
## sprayB 0.8333333 1.601110 0.5204724 6.044761e-01
## sprayC -12.4166667 1.601110 -7.7550382 7.266893e-11
## sprayD -9.5833333 1.601110 -5.9854322 9.816910e-08
## sprayE -11.0000000 1.601110 -6.8702352 2.753922e-09
## sprayF 2.1666667 1.601110 1.3532281 1.805998e-01
Hard-coding the dummy variables
lm(count ~ spray, data = InsectSprays)# hard coding dummy variables
summary(lm(count ~ I(1 * (spray == 'B')) + I(1 * (spray == 'C')) +
I(1 * (spray == 'D')) + I(1 * (spray == 'E')) +
I(1 * (spray == 'F')), data = InsectSprays))$coefficients## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.5000000 1.132156 12.8074279 1.470512e-19
## I(1 * (spray == "B")) 0.8333333 1.601110 0.5204724 6.044761e-01
## I(1 * (spray == "C")) -12.4166667 1.601110 -7.7550382 7.266893e-11
## I(1 * (spray == "D")) -9.5833333 1.601110 -5.9854322 9.816910e-08
## I(1 * (spray == "E")) -11.0000000 1.601110 -6.8702352 2.753922e-09
## I(1 * (spray == "F")) 2.1666667 1.601110 1.3532281 1.805998e-01
Linear model fit with all 6 categories
# linear fit with 6 dummy variables
lm(count ~ I(1 * (spray == 'B')) + I(1 * (spray == 'C')) +
I(1 * (spray == 'D')) + I(1 * (spray == 'E')) +
I(1 * (spray == 'F')) + I(1 * (spray == 'A')),
data = InsectSprays)$coefficients## (Intercept) I(1 * (spray == "B")) I(1 * (spray == "C"))
## 14.5000000 0.8333333 -12.4166667
## I(1 * (spray == "D")) I(1 * (spray == "E")) I(1 * (spray == "F"))
## -9.5833333 -11.0000000 2.1666667
## I(1 * (spray == "A"))
## NA
A is NALinear model fit with omitted intercept
# linear model with omitted intercept
nfit <- summary(lm(count ~ spray - 1, data = InsectSprays))$coefficients; nfit## Estimate Std. Error t value Pr(>|t|)
## sprayA 14.500000 1.132156 12.807428 1.470512e-19
## sprayB 15.333333 1.132156 13.543487 1.001994e-20
## sprayC 2.083333 1.132156 1.840148 7.024334e-02
## sprayD 4.916667 1.132156 4.342749 4.953047e-05
## sprayE 3.500000 1.132156 3.091448 2.916794e-03
## sprayF 16.666667 1.132156 14.721181 1.573471e-22
# actual means of count by each variable
round(tapply(InsectSprays$count, InsectSprays$spray, mean), 2)## A B C D E F
## 14.50 15.33 2.08 4.92 3.50 16.67
Reordering the levels
relevel(var, "l") = reorders the factor levels within the factor variable var such that the specified level “l” is the reference/base/lowest level
# reorder the levels of spray variable such that C is the lowest level
spray2 <- relevel(InsectSprays$spray, "C")
# rerun linear regression with releveled factor
fit2 <- summary(lm(count ~ spray2, data = InsectSprays))$coef; fit2## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.083333 1.132156 1.840148 7.024334e-02
## spray2A 12.416667 1.601110 7.755038 7.266893e-11
## spray2B 13.250000 1.601110 8.275511 8.509776e-12
## spray2D 2.833333 1.601110 1.769606 8.141205e-02
## spray2E 1.416667 1.601110 0.884803 3.794750e-01
## spray2F 14.583333 1.601110 9.108266 2.794343e-13
Calculating t-statistics manually
#(fit[2,1]-fit[3,1])/1.60111
fit2[3,1]/fit2[3,2]## [1] 8.275511
Numeric = values for children aged <5 years underweight (%)Sex = records whetherYear = year when data was recordedIncome = income for the child’s parents# load in hunger data
hunger <- read.csv("./data/hunger.csv")
# exclude the data with "Both Sexes" as values (only want Male vs Female)
hunger <- hunger[hunger$Sex!="Both sexes", ]
# structure of data
str(hunger)## 'data.frame': 948 obs. of 12 variables:
## $ Indicator : Factor w/ 1 level "Children aged <5 years underweight (%)": 1 1 1 1 1 1 1 1 1 1 ...
## $ Data.Source : Factor w/ 670 levels "NLIS_310005",..: 7 52 519 380 548 551 396 503 643 632 ...
## $ PUBLISH.STATES: Factor w/ 1 level "Published": 1 1 1 1 1 1 1 1 1 1 ...
## $ Year : int 1986 1990 2005 2002 2008 2008 2003 2006 2012 1999 ...
## $ WHO.region : Factor w/ 6 levels "Africa","Americas",..: 1 2 2 3 1 1 1 2 1 2 ...
## $ Country : Factor w/ 151 levels "Afghanistan",..: 115 104 97 69 57 54 71 13 115 144 ...
## $ Sex : Factor w/ 3 levels "Both sexes","Female",..: 3 3 3 2 2 3 3 3 3 2 ...
## $ Display.Value : num 19.3 2.2 5.3 3.2 17 15.7 19.3 4 15.5 4.2 ...
## $ Numeric : num 19.3 2.2 5.3 3.2 17 15.7 19.3 4 15.5 4.2 ...
## $ Low : logi NA NA NA NA NA NA ...
## $ High : logi NA NA NA NA NA NA ...
## $ Comments : logi NA NA NA NA NA NA ...
# run linear model with Numeric vs Year for male and females
male.fit <- lm(Numeric ~ Year, data = hunger[hunger$Sex == "Male", ])
female.fit <- lm(Numeric ~ Year, data = hunger[hunger$Sex == "Female", ])
# plot % hungry vs the year
plot(Numeric ~ Year, data = hunger, pch = 19, col=(Sex=="Male")*1+1)
# plot regression lines for both
abline(male.fit, lwd = 3, col = "red")
abline(female.fit, lwd = 3, col = "black")abline(intercept, slope) = adds a line to the existing plot based on the intercept and slope provided
abline(lm) = plots the linear regression line on the plot# run linear model with Numeric vs Year and Sex
both.fit <- lm(Numeric ~ Year+Sex, data = hunger)
# print fit
both.fit$coef## (Intercept) Year SexMale
## 633.528289 -0.308397 1.902743
# plot % hungry vs the year
plot(Numeric ~ Year, data = hunger, pch = 19, col=(Sex=="Male")*1+1)
# plot regression lines for both (same slope)
abline(both.fit$coef[1], both.fit$coef[2], lwd = 3, col = "black")
abline(both.fit$coef[1]+both.fit$coef[3], both.fit$coef[2], lwd = 3, col = "red")lm(outcome ~ var1*var2) = whenever an interaction is specified in lm function using the * operator, the individual terms are added automatically
lm(outcome ~ var1+var2+var1*var2) = builds the exact same modellm(outcome ~ var1:var2) = builds linear model with only the interaction term (specified by : operator)# run linear model with Numeric vs Year and Sex and interaction term
interaction.fit <- lm(Numeric ~ Year*Sex, data = hunger)
# print fit
interaction.fit$coef## (Intercept) Year SexMale Year:SexMale
## 603.50579986 -0.29339638 61.94771998 -0.03000132
# plot % hungry vs the year
plot(Numeric ~ Year, data = hunger, pch = 19, col=(Sex=="Male")*1+1)
# plot regression lines for both (different slope)
abline(interaction.fit$coef[1], interaction.fit$coef[2], lwd = 3, col = "black")
abline(interaction.fit$coef[1]+interaction.fit$coef[3],
interaction.fit$coef[2]+interaction.fit$coef[4], lwd = 3, col = "red")# generate some income data
hunger$Income <- 1:nrow(hunger)*10 + 500*runif(nrow(hunger), 0, 10) +
runif(nrow(hunger), 0, 500)^1.5
# run linear model with Numeric vs Year and Income and interaction term
lm(Numeric ~ Year*Income, data = hunger)$coef## (Intercept) Year Income Year:Income
## 1.076723e+03 -5.290447e-01 -3.863526e-02 1.927627e-05
Intro: Adjustment, is the idea of putting regressors into a linear model to investigate the role of a third variable on the relationship between another two. Since it is often the case that a third variable can distort, or confound if you will, the relationship between two others.
As an example, consider looking at lung cancer rates and breath mint usage. For the sake of completeness, imagine if you were looking at forced expiratory volume (a measure of lung function) and breath mint usage. If you found a statistically significant regression relationship, it wouldn’t be wise to rush off to the newspapers with the headline “Breath mint usage causes shortness of breath!”, for a variety of reasons. First off, even if the association is sound, you don’t know that it’s causal. But, more importantly in this case, the likely culprit is smoking habits. Smoking rates are likely related to both breath mint usage rates and lung function. How would you defend your finding against the accusation that it’s just variability in smoking habits?
If your finding held up among non-smokers and smokers analyzed separately, then you might have something. In other words, people wouldn’t even begin to believe this finding unless it held up while holding smoking status constant. That is the idea of adding a regression variable into a model as adjustment. The coefficient of interest is interpreted as the effect of the predictor on the response, holding the adjustment variable constant.
In this lecture, we’ll use simulation to investigate how adding a regressor into a model addresses the idea of adjustment.
lm(y ~ x + t)# simulate data
n <- 100; t <- rep(c(0, 1), c(n/2, n/2)); x <- c(runif(n/2), runif(n/2));
# define parameters/coefficients
beta0 <- 0; beta1 <- 2; beta2 <- 1; sigma <- .2
# generate outcome using linear model
y <- beta0 + x * beta1 + t * beta2 + rnorm(n, sd = sigma)
# set up axes
plot(x, y, type = "n", frame = FALSE)
# plot linear fit of y vs x
abline(lm(y ~ x), lwd = 2, col = "blue")
# plot means of the two groups (t = 0 vs t = 1)
abline(h = mean(y[1 : (n/2)]), lwd = 3, col = "red")
abline(h = mean(y[(n/2 + 1) : n]), lwd = 3, col = "red")
# plot linear fit of y vs x and t
fit <- lm(y ~ x + t)
# plot the two lines corresponding to (t = 0 vs t = 1)
abline(coef(fit)[1], coef(fit)[2], lwd = 3)
abline(coef(fit)[1] + coef(fit)[3], coef(fit)[2], lwd = 3)
# add in the actual data points
points(x[1 : (n/2)], y[1 : (n/2)], pch = 21, col = "black", bg = "lightblue", cex = 1)
points(x[(n/2 + 1) : n], y[(n/2 + 1) : n], pch = 21, col = "black", bg = "salmon", cex = 1)# print treatment and adjustment effects
rbind("Treatment Effect" = lm(y~t+x)$coef[2], "Adjustment Effect" = lm(y~t)$coef[2])## t
## Treatment Effect 1.016810
## Adjustment Effect 1.024089
lm(y ~ t)lm(y ~ t + x)lm(y ~ x)# simulate data
n <- 100; t <- rep(c(0, 1), c(n/2, n/2)); x <- c(runif(n/2), 1.5 + runif(n/2));
# define parameters/coefficients
beta0 <- 0; beta1 <- 2; beta2 <- 0; sigma <- .2
# generate outcome using linear model
y <- beta0 + x * beta1 + t * beta2 + rnorm(n, sd = sigma)
# set up axes
plot(x, y, type = "n", frame = FALSE)
# plot linear fit of y vs x
abline(lm(y ~ x), lwd = 2, col = "blue")
# plot means of the two groups (t = 0 vs t = 1)
abline(h = mean(y[1 : (n/2)]), lwd = 3, col = "red")
abline(h = mean(y[(n/2 + 1) : n]), lwd = 3, col = "red")
# plot linear fit of y vs x and t
fit <- lm(y ~ x + t)
# plot the two lines corresponding to (t = 0 vs t = 1)
abline(coef(fit)[1], coef(fit)[2], lwd = 3)
abline(coef(fit)[1] + coef(fit)[3], coef(fit)[2], lwd = 3)
# add in the actual data points
points(x[1 : (n/2)], y[1 : (n/2)], pch = 21, col = "black", bg = "lightblue", cex = 1)
points(x[(n/2 + 1) : n], y[(n/2 + 1) : n], pch = 21, col = "black", bg = "salmon", cex = 1)# print treatment and adjustment effects
rbind("Treatment Effect" = lm(y~t+x)$coef[2], "Adjustment Effect" = lm(y~t)$coef[2])## t
## Treatment Effect 0.1825682
## Adjustment Effect 3.1079122
lm(y ~ t)lm(y ~ t + x)
lm(y ~ x) since coefficient of \(t\) or \(\beta_2 = 0\)lm(y ~ x)
# simulate data
n <- 100; t <- rep(c(0, 1), c(n/2, n/2)); x <- c(runif(n/2), .9 + runif(n/2));
# define parameters/coefficients
beta0 <- 0; beta1 <- 2; beta2 <- -1; sigma <- .2
# generate outcome using linear model
y <- beta0 + x * beta1 + t * beta2 + rnorm(n, sd = sigma)
# set up axes
plot(x, y, type = "n", frame = FALSE)
# plot linear fit of y vs x
abline(lm(y ~ x), lwd = 2, col = "blue")
# plot means of the two groups (t = 0 vs t = 1)
abline(h = mean(y[1 : (n/2)]), lwd = 3, col = "red")
abline(h = mean(y[(n/2 + 1) : n]), lwd = 3, col = "red")
# plot linear fit of y vs x and t
fit <- lm(y ~ x + t)
# plot the two lines corresponding to (t = 0 vs t = 1)
abline(coef(fit)[1], coef(fit)[2], lwd = 3)
abline(coef(fit)[1] + coef(fit)[3], coef(fit)[2], lwd = 3)
# add in the actual data points
points(x[1 : (n/2)], y[1 : (n/2)], pch = 21, col = "black", bg = "lightblue", cex = 1)
points(x[(n/2 + 1) : n], y[(n/2 + 1) : n], pch = 21, col = "black", bg = "salmon", cex = 1)# print treatment and adjustment effects
rbind("Treatment Effect" = lm(y~t+x)$coef[2], "Adjustment Effect" = lm(y~t)$coef[2])## t
## Treatment Effect -0.9992889
## Adjustment Effect 0.7716072
lm(y ~ t)lm(y ~ t + x)lm(y ~ x)# simulate data
n <- 100; t <- rep(c(0, 1), c(n/2, n/2)); x <- c(.5 + runif(n/2), runif(n/2));
# define parameters/coefficients
beta0 <- 0; beta1 <- 2; beta2 <- 1; sigma <- .2
# generate outcome using linear model
y <- beta0 + x * beta1 + t * beta2 + rnorm(n, sd = sigma)
# set up axes
plot(x, y, type = "n", frame = FALSE)
# plot linear fit of y vs x
abline(lm(y ~ x), lwd = 2, col = "blue")
# plot means of the two groups (t = 0 vs t = 1)
abline(h = mean(y[1 : (n/2)]), lwd = 3, col = "red")
abline(h = mean(y[(n/2 + 1) : n]), lwd = 3, col = "red")
# plot linear fit of y vs x and t
fit <- lm(y ~ x + t)
# plot the two lines corresponding to (t = 0 vs t = 1)
abline(coef(fit)[1], coef(fit)[2], lwd = 3)
abline(coef(fit)[1] + coef(fit)[3], coef(fit)[2], lwd = 3)
# add in the actual data points
points(x[1 : (n/2)], y[1 : (n/2)], pch = 21, col = "black", bg = "lightblue", cex = 1)
points(x[(n/2 + 1) : n], y[(n/2 + 1) : n], pch = 21, col = "black", bg = "salmon", cex = 1)# print treatment and adjustment effects
rbind("Treatment Effect" = lm(y~t+x)$coef[2], "Adjustment Effect" = lm(y~t)$coef[2])## t
## Treatment Effect 0.9199923
## Adjustment Effect 0.1756044
lm(y ~ t)lm(y ~ t + x)lm(y ~ x)# simulate data
n <- 100; t <- rep(c(0, 1), c(n/2, n/2)); x <- c(runif(n/2, -1, 1), runif(n/2, -1, 1));
# define parameters/coefficients
beta0 <- 0; beta1 <- 2; beta2 <- 0; beta3 <- -4; sigma <- .2
# generate outcome using linear model
y <- beta0 + x * beta1 + t * beta2 + t * x * beta3 + rnorm(n, sd = sigma)
# set up axes
plot(x, y, type = "n", frame = FALSE)
# plot linear fit of y vs x
abline(lm(y ~ x), lwd = 2, col = "blue")
# plot means of the two groups (t = 0 vs t = 1)
abline(h = mean(y[1 : (n/2)]), lwd = 3, col = "red")
abline(h = mean(y[(n/2 + 1) : n]), lwd = 3, col = "red")
# plot linear fit of y vs x and t and interaction term
fit <- lm(y ~ x + t + I(x * t))
# plot the two lines corresponding to (t = 0 vs t = 1)
abline(coef(fit)[1], coef(fit)[2], lwd = 3)
abline(coef(fit)[1] + coef(fit)[3], coef(fit)[2] + coef(fit)[4], lwd = 3)
# add in the actual data points
points(x[1 : (n/2)], y[1 : (n/2)], pch = 21, col = "black", bg = "lightblue", cex = 1)
points(x[(n/2 + 1) : n], y[(n/2 + 1) : n], pch = 21, col = "black", bg = "salmon", cex = 1)lm(y ~ t)lm(y ~ t + x + t*x)lm(y ~ x)
# simulate data
p <- 1; n <- 100; x2 <- runif(n); x1 <- p * runif(n) - (1 - p) * x2
# define parameters/coefficients
beta0 <- 0; beta1 <- 1; beta2 <- 4 ; sigma <- .01
# generate outcome using linear model
y <- beta0 + x1 * beta1 + beta2 * x2 + rnorm(n, sd = sigma)
# plot y vs x1 and x2
qplot(x1, y) + geom_point(aes(colour=x2)) + geom_smooth(method = lm)rgl package to generate 3D plotsrgl::plot3d(x1, x2, y)# plot the residuals for y and x1 with x2 removed
plot(resid(lm(x1 ~ x2)), resid(lm(y ~ x2)), frame = FALSE,
col = "black", bg = "lightblue", pch = 21, cex = 1)
# add linear fit line
abline(lm(I(resid(lm(y ~ x2))) ~ I(resid(lm(x1 ~ x2)))), lwd = 2)fit <- lm(y~x), we can use the plot(fit) to produce a series of 4 diagnostic plots
# load swiss data and
data(swiss)
# run linear regression on Fertility vs all other predictors
fit <- lm(Fertility ~ . , data = swiss)
# generate diagnostic plots in 2 x 2 panels
par(mfrow = c(2, 2)); plot(fit)# generate data
n <- 100; x <- rnorm(n); y <- x + rnorm(n, sd = .3)
# set up axes
plot(c(-3, 6), c(-3, 6), type = "n", frame = FALSE, xlab = "X", ylab = "Y")
# plot regression line for y vs x
abline(lm(y ~ x), lwd = 2)
# plot actual (x, y) pairs
points(x, y, cex = 1, bg = "lightblue", col = "black", pch = 21)
# plot 4 points of interest
points(0, 0, cex = 1.5, bg = "darkorange", col = "black", pch = 21)
points(0, 5, cex = 1.5, bg = "darkorange", col = "black", pch = 21)
points(5, 5, cex = 1.5, bg = "darkorange", col = "black", pch = 21)
points(5, 0, cex = 1.5, bg = "darkorange", col = "black", pch = 21)stats package in R
?influence.measures in R will display the detailed documentation on all available functions to measure influence model argument referenced in the following functions is simply the linear fit model generated by the lm function (i.e. model <- lm(y~x) rstandard(model) = standardized residuals \(\rightarrow\) residuals divided by their standard deviationsrstudent(model) = standardized residuals \(\rightarrow\) residuals divided by their standard deviations, where the \(i^{th}\) data point was deleted in the calculation of the standard deviation for the residual to follow a t distributionhatvalues(model) = measures of leverage; “how far it is away”dffits(model) = change in the predicted response when the \(i^{th}\) point is deleted in fitting the model
dfbetas(model) = change in individual coefficients when the \(i^{th}\) point is deleted in fitting the model, standardized by a deleted estimate of the coefficient standard error
dfbeta(model) = difference in coefficients for including vs excluding each data point (unscaled version of dfbetas, ie. not standardized)cooks(model).distance = overall change in coefficients when the \(i^{th}\) point is deletedresid(model) = returns ordinary residualsresid(model)/(1-hatvalues(model)) = returns PRESS residuals (i.e. the leave one out cross validation residuals)
swirl lesson digs deeper into the manual calculations of dfbeta, hatvalues, rstandard, rstudent and cooks(model).distance # generate random data and point (10, 10)
x <- c(10, rnorm(n)); y <- c(10, c(rnorm(n)))
# plot y vs x
plot(x, y, frame = FALSE, cex = 1, pch = 21, bg = "lightblue", col = "black")
# perform linear regression
fit <- lm(y ~ x)
# add regression line to plot
abline(fit)dfbetas(fit) = difference in coefficients for including vs excluding each data point
n x m dataframe, where n = number of values in the original dataset, and m = number of coefficients# calculate the dfbetas for the slope the first 10 points
round(dfbetas(fit)[1 : 10, 2], 3)## 1 2 3 4 5 6 7 8 9 10
## 6.859 0.135 -0.048 -0.049 0.064 -0.015 -0.022 -0.037 0.009 -0.058
as we can see from above, the slope coefficient would change dramatically if the first point (10, 10) is left out
hatvalues(fit) = measures the potential for influence for each point
# calculate the hat values for the first 10 points
round(hatvalues(fit)[1 : 10], 3)## 1 2 3 4 5 6 7 8 9 10
## 0.549 0.022 0.025 0.017 0.012 0.010 0.012 0.011 0.010 0.012
Manual calculations of Influence Measures
dfbeta(fit) = difference in coefficients for including vs excluding each data point (unscaled)# exclude outlier in the first element
fitno <- lm(y ~ x, data.frame(y = y[-1], x = x[-1]))
# manually calculate the difference in coefficients
df.diff <- coef(fit)-coef(fitno)
# using dfbeta function to do the same
df.beta <- dfbeta(fit)[1,]
rbind("manually" = df.diff, "dfbeta" = df.beta) ## (Intercept) x
## manually 0.03323954 0.4935411
## dfbeta 0.03323954 0.4935411
hatvalue for the 1st element (outlier)
# calculate residual from model without outlier by plugging outlier into fitted model
resno <- y[1] - predict(fitno, data.frame(x = x[1]))
# 1 minus ratio of residual of fit inclusive outlier (numerator) to residual of fit exclusive outlier (denominator)
hat.manual <- 1-resid(fit)[1]/resno
# using hatvalues function to do the same
hat.values <- hatvalues(fit)[1]
rbind("manually" = hat.manual, "hatvalues" = hat.values)## 1
## manually 0.5486468
## hatvalues 0.5486468
rstandard(model) = standardized residuals \(\rightarrow\) residuals divided by their standard deviations# Calculate standardized residuals
# First, calculate the sample standard deviation of fit's residual by dividing fit's deviance
# i.e., its residual sum of squares, by the residual degrees of freedom and taking the square root
sigma <- sqrt(deviance(fit)/df.residual(fit))
# estimate standard deviations of individual samples (=standardized residuals)
rstd <- resid(fit)/(sigma * sqrt(1-hatvalues(fit)))
# using rstandard function to do the same as above
r.standard <- rstandard(fit)
rbind("manually" = rstd[1], "rstandard" = r.standard[1])## 1
## manually 5.328772
## rstandard 5.328772
rstudent(model) = standardized residuals \(\rightarrow\) residuals divided by their standard deviations, where the \(i^{th}\) data point was deleted in the calculation of the standard deviation for the residual to follow a t distribution# Calculate (externally) Studentized residuals (in this example we leave the outlier out)
# Recalling that the model we called fitno omits the outlier sample
# Calculate the sample standard deviation of fitno's residual by dividing its deviance, by its
# residual degrees of freedom and taking the square root
sigma1 <- sqrt(deviance(fitno)/df.residual(fitno))
# Calculate the Studentized residual for the outlier sample
rstdt <- resid(fit)/(sigma1 * sqrt(1-hatvalues(fit)))
# using rstudent function to do the same as above
r.student <- rstudent(fit)
rbind("manually" = rstdt[1], "rstudent" = r.student[1])## 1
## manually 6.278053
## rstudent 6.278053
cooks(model).distance = overall change in coefficients when the \(i^{th}\) point is deleted
# Calculate the difference in predicted values between fit and fitno,
# the models which respectively include and omit the outlier
dy <- predict(fitno, data.frame(x = x)) - predict(fit, data.frame(x = x))
# Calculate the outlier's Cook's distance
cook.manual <- sum(dy^2)/(2*sigma^2)
# using cook.distance function to do the same as above
cook.dist <- cooks.distance(fit)[1]
rbind("manually" = cook.manual, "cook.distance" = cook.dist)## 1
## manually 17.2584
## cook.distance 17.2584
# generate data
x <- rnorm(n); y <- x + rnorm(n, sd = .3)
# add an outlier that fits the relationship
x <- c(5, x); y <- c(5, y)
# plot the (x, y) pairs
plot(x, y, frame = FALSE, cex = 1, pch = 21, bg = "lightblue", col = "black")
# perform the linear regression
fit2 <- lm(y ~ x)
# add the regression line to the plot
abline(fit2)# calculate the dfbetas for the slope the first 10 points
round(dfbetas(fit2)[1 : 10, 2], 3)## 1 2 3 4 5 6 7 8 9 10
## 0.079 0.068 0.043 -0.069 -0.021 -0.021 0.030 0.007 -0.006 -0.028
# calculate the hat values for the first 10 points
round(hatvalues(fit2)[1 : 10], 3)## 1 2 3 4 5 6 7 8 9 10
## 0.175 0.035 0.013 0.015 0.013 0.014 0.020 0.019 0.013 0.025
dfbetas value (indication of low influence) but still has a substantial hatvalue (indication of high leverage)
# read data
data <- read.table('http://www4.stat.ncsu.edu/~stefanski/NSF_Supported/Hidden_Images/orly_owl_files/orly_owl_Lin_4p_5_flat.txt',
header = FALSE)
# construct pairwise plot
pairs(data)# perform regression on V1 with all other predictors (omitting the intercept)
fit <- lm(V1 ~ . - 1, data = data)
# print the coefficient for linear model
summary(fit)$coef## Estimate Std. Error t value Pr(>|t|)
## V2 0.9856157 0.12798121 7.701253 1.989126e-14
## V3 0.9714707 0.12663829 7.671225 2.500259e-14
## V4 0.8606368 0.11958267 7.197003 8.301184e-13
## V5 0.9266981 0.08328434 11.126919 4.778110e-28
# plot the residuals vs fitted values
plot(predict(fit), resid(fit), pch = '.')“A model is a lense through which to look at your data” – Scott Zeger
“There’s no such thing as a correct model” – Brian Caffo
“There are known knowns. These are things we know that we know. There are known unknowns. That is to say, there are things that we know we don’t know. But there are also unknown unknowns. There are things we don’t know we don’t know.” – Donald Rumsfeld
y~x1+x2 = 1.5 and standard error for the \(\beta_1\) of y~x1 = 0.5, then the actual increase in standard error of the \(\beta_1\) = 1.5/0.5 - 1 = 200%# set number of measurements
n <- 100
# set up the axes of the plot
plot(c(1, n), 0 : 1, type = "n", xlab = "p", ylab = expression(R^2),
main = expression(paste(R^2, " vs n")))
# for each value of p from 1 to n
r <- sapply(1 : n, function(p){
# create outcome and p regressors
y <- rnorm(n); x <- matrix(rnorm(n * p), n, p)
# calculate the R^2
summary(lm(y ~ x))$r.squared
})
# plot the R^2 values and connect them with a line
lines(1 : n, r, lwd = 2)x1summary(fit)$cov.unscaled = returns p x p covariance matrix for p coefficients, with the diagonal values as the true variances of coefficientssummary(fit)$cov.unscaled[2,2] = true variance for the \(\beta_1\)# generate outcome that is correlated with x1
y <- x1 + rnorm(n, sd = .3)
# store the variance of beta1 for the 1st model
a <- summary(lm(y ~ x1))$cov.unscaled[2,2]
# calculate the ratio of variances of beta1 for 2nd and 3rd models with respect to 1st model
c(summary(lm(y ~ x1 + x2))$cov.unscaled[2,2],
summary(lm(y~ x1 + x2 + x3))$cov.unscaled[2,2]) / a - 1## [1] 0.9186974 6.7619190
# alternatively, the change in variance can be estimated by calculating ratio of trials variance
temp <- apply(betas, 1, var); temp[2 : 3] / temp[1] - 1## x1 x1
## 0.9769326 6.5328874
swiss data set for this example, and compare the following models
# load swiss data
data(swiss)
# run linear regression for Fertility vs Agriculture
fit <- lm(Fertility ~ Agriculture, data = swiss)
# variance for coefficient of Agriculture
a <- summary(fit)$cov.unscaled[2,2]
# run linear regression for Fertility vs Agriculture + Examination
fit2 <- update(fit, Fertility ~ Agriculture + Examination)
# run linear regression for Fertility vs Agriculture + Examination + Education
fit3 <- update(fit, Fertility ~ Agriculture + Examination + Education)
# calculate ratios of variances for Agriculture coef for fit2 and fit3 w.r.t fit1
c(summary(fit2)$cov.unscaled[2,2], summary(fit3)$cov.unscaled[2,2]) / a - 1## [1] 0.8915757 1.0891588
car library] vit(fit) = returns the variance inflation factors for the predictors of the given linear model# load car library
library(car)
# run linear regression on Fertility vs all other predictors
fit <- lm(Fertility ~ . , data = swiss)
# calculate the variance inflation factors
vif(fit)## Agriculture Examination Education Catholic
## 2.284129 3.675420 2.774943 1.937160
## Infant.Mortality
## 1.107542
# calculate the standard error inflation factors
sqrt(vif(fit))## Agriculture Examination Education Catholic
## 1.511334 1.917138 1.665816 1.391819
## Infant.Mortality
## 1.052398
# excluding variable Examination
fit2 <- lm(Fertility ~ . -Examination, swiss)
# calculate the variance inflation factors
vif(fit2)## Agriculture Education Catholic Infant.Mortality
## 2.147153 1.816361 1.299916 1.107528
Summary VIF
set.seed(8765)
temp <- rnorm(100)
# x1 and x2 are correlated
x1 <- (temp + rnorm(100))/sqrt(2)
x2 <- (temp + rnorm(100))/sqrt(2)
x3 <- rnorm(100)
# Function to simulate regression of y on 2 variables.
f <- function(k){
# generate y (Note: the actual coefficient of x1 is 1)
y <- x1 + x2 + x3 + .3*rnorm(100)
# regression with 2 regressors
c(lm(y ~ x1 + x2)$coef[2], # omit uncorrelated variable
lm(y ~ x1 + x3)$coef[2]) # omit correlated variable
}
# beta1 values from the 2 regressions
x1c <- sapply(1:150, f)
# calculate the means of beta1 values
apply(x1c, 1, mean)## x1 x1
## 1.034403 1.476944
# plot the histogram of beta1 values
p1 <- hist(x1c[1,], plot=FALSE)
p2 <- hist(x1c[2,], plot=FALSE)
yrange <- c(0, max(p1$counts, p2$counts))
plot(p1, col=rgb(0,0,1,1/4), xlim=range(x1c), ylim=yrange, xlab="Estimated coefficient of x1",
main="Bias Effect of Omitted Regressor")
plot(p2, col=rgb(1,0,0,1/4), xlim=range(x1c), ylim=yrange, add=TRUE)
legend(1.1, 40, c("Uncorrelated regressor, x3, omitted", "Correlated regressor, x2, omitted"),
fill=c(rgb(0,0,1,1/4), rgb(1,0,0,1/4)))x1c contains independent estimates of x1’s coefficient in the case that x3, the regressor uncorrelated with x1, is omittedanova(fit1, fit2, fit3) = performs ANOVA or analysis of variance (or deviance) tables for a series of nested linear regressions modelsstep(lm, k=df) = performs step wise regression on a given linear model to find and return best linear model
k=log(n) = specifying the value of k as the log of the number of observation will force the step-wise regression model to use Bayesian Information Criterion (BIC) instead of the AICMASS::stepAIC(lm, k = df) = more versatile, rigorous implementation of the step wise regressionswiss data set for this example, and compare the following nested models
# three different regressions that are nested
fit1 <- lm(Fertility ~ Agriculture, data = swiss)
fit3 <- update(fit, Fertility ~ Agriculture + Examination + Education)
fit5 <- update(fit, Fertility ~ Agriculture + Examination + Education + Catholic + Infant.Mortality)
# perform test using ANOVA
anova(fit1, fit3, fit5)## Analysis of Variance Table
##
## Model 1: Fertility ~ Agriculture
## Model 2: Fertility ~ Agriculture + Examination + Education
## Model 3: Fertility ~ Agriculture + Examination + Education + Catholic +
## Infant.Mortality
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 45 6283.1
## 2 43 3180.9 2 3102.2 30.211 8.638e-09 ***
## 3 41 2105.0 2 1075.9 10.477 0.0002111 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# RSS's are the sum residual sum of squares, eg. for fit3
deviance(fit3)## [1] 3180.925
Res.Df = residual degrees of freedom for the modelsRSS = residual sum of squares for the models, measure of fitDf = change in degrees of freedom from one model to the nextSum of Sq = difference/change in residual sum of squares from one model to the nextF = F statistic, measures the ratio of two scaled sums of squares divided by their respective degrees of freedom, reflecting different sources of variability \[F = \frac{\frac{RSS_1 - RSS_2}{p_2 - p_1}}{\frac{RSS_2}{n-p_2}}\] where \(p_1\) and \(p_2\) = number of parameters (= number of coefficients + 1 intercept, eg. for fit3 p = 4, for fit5 p = 6) in the two models for comparison, and \(n\) = number of observationsPr(>F) = p-value for the F statistic to indicate whether the change in model is significant or notPr(>F), applied to an F valueshapiro.test(fit3$residuals)##
## Shapiro-Wilk normality test
##
## data: fit3$residuals
## W = 0.97276, p-value = 0.336
shapiro.test(fit5$residuals)##
## Shapiro-Wilk normality test
##
## data: fit5$residuals
## W = 0.98892, p-value = 0.9318
mtcars data set for this example, and perform step-wise regression/model selection algorithm on the following initial model
# load the mtcars data starting regression model
data(mtcars); fit <- lm(mpg ~ cyl + disp + hp + drat + wt, data = mtcars)
# step-wise search using BIC
step(fit, k = log(nrow(mtcars)))## Start: AIC=73.75
## mpg ~ cyl + disp + hp + drat + wt
##
## Df Sum of Sq RSS AIC
## - drat 1 3.018 170.44 70.854
## - disp 1 6.949 174.38 71.584
## - cyl 1 15.411 182.84 73.100
## <none> 167.43 73.748
## - hp 1 21.066 188.49 74.075
## - wt 1 77.476 244.90 82.453
##
## Step: AIC=70.85
## mpg ~ cyl + disp + hp + wt
##
## Df Sum of Sq RSS AIC
## - disp 1 6.176 176.62 68.528
## - hp 1 18.048 188.49 70.609
## <none> 170.44 70.854
## - cyl 1 24.546 194.99 71.694
## - wt 1 90.925 261.37 81.069
##
## Step: AIC=68.53
## mpg ~ cyl + hp + wt
##
## Df Sum of Sq RSS AIC
## - hp 1 14.551 191.17 67.595
## - cyl 1 18.427 195.05 68.237
## <none> 176.62 68.528
## - wt 1 115.354 291.98 81.147
##
## Step: AIC=67.6
## mpg ~ cyl + wt
##
## Df Sum of Sq RSS AIC
## <none> 191.17 67.595
## - cyl 1 87.15 278.32 76.149
## - wt 1 117.16 308.33 79.426
##
## Call:
## lm(formula = mpg ~ cyl + wt, data = mtcars)
##
## Coefficients:
## (Intercept) cyl wt
## 39.686 -1.508 -3.191
mpg ~ cyl + wtIntro: Generalized linear models (GLMs) were a great advance in statistical modeling. The original manuscript with the GLM framework was from Nelder and Wedderburn in 1972 in the Journal of the Royal Statistical Society. The McCullagh and Nelder book is the famous standard treatise on the subject.
Recall linear models. Linear models are the most useful applied statistical technique. However, they are not without their limitations. Additive response models don’t make much sense if the response is discrete, or strictly positive. Additive error models often don’t make sense, for example, if the outcome has to be positive. Transformations, such as taking a cube root of a count outcome, are often hard to interpret. In addition, there’s value in modeling the data on the scale that it was collected. Particularly interpretable transformations, natural logarithms in specific, aren’t applicable for negative or zero values.
The generalized linear model is a family of models that includes linear models. By extending the family, it handles many of the issues with linear models, but at the expense of some complexity and loss of some of the mathematical tidiness. A GLM involves three components: (1) An exponential family model for the response, (2) A systematic component via a linear predictor, and (3) A link function that connects the means of the response to the linear predictor.
The three most famous cases of GLMs are: linear models, binomial and binary regression and Poisson regression. We’ll go through the GLM model specification and likelihood for all three. For linear models, we’ve developed them previously. The next two modules will be devoted to binomial and Poisson regression. We’ll only focus on the most popular and useful link functions.
glm(), its possible to specify for the model to solve using quasi-likelihood normal equations instead of normal equations through the parameter family = quasi-binomial and family = quasi-poisson respectivelyIntro: Binary GLMs come from trying to model outcomes that can take only two values. Some examples include: survival or not at the end of a study, winning versus losing of a team and success versus failure of a treatment or product. Often these outcomes are called Bernoulli outcomes, from the Bernoulli distribution named after the famous probabilist and mathematician.
If we happen to have several exchangeable binary outcomes for the same level of covariate values, then that is binomial data and we can aggregate the 0’s and 1’s into the count of 1’s. As an example, imagine if we sprayed insect pests with 4 different pesticides and counted whether they died or not. Then for each spray, we could summarize the data with the count of dead and total number that were sprayed and treat the data as binomial rather than Bernoulli.
ravenWinNum = 1 for Raven win, 0 for Raven lossravenWin = W for Raven win, L for Raven lossravenScore = score of the Raven team during the matchopponentScore = score of the Raven team during the match# load the data
load("./data/ravensData.rda")
ravensData## ravenWinNum ravenWin ravenScore opponentScore
## 1 1 W 24 9
## 2 1 W 38 35
## 3 1 W 28 13
## 4 1 W 34 31
## 5 1 W 44 13
## 6 0 L 23 24
## 7 1 W 31 30
## 8 1 W 23 16
## 9 1 W 9 6
## 10 1 W 31 29
## 11 0 L 13 43
## 12 1 W 25 15
## 13 1 W 55 20
## 14 1 W 13 10
## 15 1 W 16 13
## 16 0 L 20 23
## 17 0 L 28 31
## 18 0 L 17 34
## 19 1 W 33 14
## 20 0 L 17 23
# boxplot of wins and losses
boxplot(ravenScore ~ ravenWin, ravensData, col=0x96, lwd=3, horizontal=TRUE,
col.lab="purple", col.main="purple", xlab="Ravens' Score",
main="Ravens' Wins and Losses vs Score")
abline(v=23, lwd=5, col="purple")# plot probabilities from the data
plot(c(3,23,29,55), c(0.5, 0.5, 1.0, 1.0), type='l', lwd=5, col="purple", col.lab="purple", ylim=c(0.25,1),
xlab="Ravens' Score", ylab="Probability of a Ravens win", col.main="purple",
main="Crude estimate of Ravens' win probability\nvs the points which they score.")# perform linear regression
summary(lm(ravenWinNum ~ ravenScore, data = ravensData))##
## Call:
## lm(formula = ravenWinNum ~ ravenScore, data = ravensData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.7302 -0.5076 0.1824 0.3215 0.5719
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.285032 0.256643 1.111 0.2814
## ravenScore 0.015899 0.009059 1.755 0.0963 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4464 on 18 degrees of freedom
## Multiple R-squared: 0.1461, Adjusted R-squared: 0.09868
## F-statistic: 3.08 on 1 and 18 DF, p-value: 0.09625
manipulate function, vary \(\beta_0\) and \(\beta_1\) to fit logistic regression curves for simulated data# set x values for the points to be plotted
x <- seq(-10, 10, length = 1000)
# "library(manipulate)" is needed to use the manipulate function
manipulate(
# plot the logistic regression curve
plot(x, exp(beta0 + beta1 * x) / (1 + exp(beta0 + beta1 * x)),
type = "l", lwd = 3, frame = FALSE),
# slider for beta1
beta1 = slider(-2, 2, step = .1, initial = 2),
# slider for beta0
beta0 = slider(-2, 2, step = .1, initial = 0)
)glm(outcome ~ predictor, family = "binomial") to fit a logistic regression to the data# run logistic regression on data
logRegRavens <- glm(ravenWinNum ~ ravenScore, data = ravensData, family="binomial")
# print summary
summary(logRegRavens)##
## Call:
## glm(formula = ravenWinNum ~ ravenScore, family = "binomial",
## data = ravensData)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7575 -1.0999 0.5305 0.8060 1.4947
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.68001 1.55412 -1.081 0.28
## ravenScore 0.10658 0.06674 1.597 0.11
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 24.435 on 19 degrees of freedom
## Residual deviance: 20.895 on 18 degrees of freedom
## AIC: 24.895
##
## Number of Fisher Scoring iterations: 5
# take e^coefs to find the ratios
exp(logRegRavens$coeff)## (Intercept) ravenScore
## 0.1863724 1.1124694
# take e^log confidence interval to find the confidence intervals
exp(confint(logRegRavens))## 2.5 % 97.5 %
## (Intercept) 0.005674966 3.106384
## ravenScore 0.996229662 1.303304
# plot probabilities from the data
plot(c(3,23,29,55), c(0.5, 0.5, 1.0, 1.0), type='l', lwd=5, col="purple", col.lab="purple", ylim=c(0.25,1),
xlab="Ravens' Score", ylab="Probability of a Ravens win", col.main="purple",
main="Ravens' win vs score probabilities: GLM maximum likelihood estimates\ncompared to crude estimates")
# plot the logistic regression
points(ravensData$ravenScore,logRegRavens$fitted,pch=19,col="blue",xlab="Score",ylab="Prob Ravens Win")
# add legend
legend('bottomright', c("Crude estimates", "GLM maximum likelihood estimates"), lwd=5, lty=1,
col=c("purple", "blue"))The probabilities estimated by logistic regression using glm() are represented by the blue dots. It is more reasonable than our crude estimate in several respects: It increases smoothly with score, it estimates that 15 points give the Ravens a 50% chance of winning, that 28 points give them an 80% chance, and that 55 points make a win very likely (98%) but not absolutely certain.
Since we don’t have data below a score of 9, we can use R’s predict() function to see the model’s estimates for lower scores. Since this gives us log odds, we will have to convert to probabilities.
# calculate log odds
logodds<-predict(logRegRavens, data.frame(ravenScore=c(0,3,6)))
# convert to probabilities
exp(logodds)/(1+exp(logodds))## 1 2 3
## 0.1570943 0.2041977 0.2610505
# perform analysis of variance
anova(logRegRavens,test="Chisq")## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: ravenWinNum
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 19 24.435
## ravenScore 1 3.5398 18 20.895 0.05991 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Df = change in degrees of freedom
ravenScore parameter (slope)Deviance = measure of goodness of model fit compare to the previous modelResid. Dev = residual deviance for current modelPr(>Chi) = used to evaluate the significance of the added parameterDeviance value of 3.54 is actually the difference between the deviance of our model, which includes a slope, and that of a model which includes only an intercept.Intro: Many data take the form of unbounded count data. For example, consider the number of calls to a call center or the number of flu cases in an area or the number of hits to a web site.
In some cases the counts are clearly bounded. However, modeling the counts as unbounded is often done when the upper limit is not known or very large relative to the number of events.
If the upper bound is known, the techniques we’re discussing can be used to model the proportion or rate. The starting point for most count analysis is the the Poisson distribution.
# set up 1x3 panel plot
par(mfrow = c(1, 3))
# Poisson distribution for t = 1, and lambda = 2
plot(0 : 10, dpois(0 : 10, lambda = 2), type = "h", frame = FALSE)
# Poisson distribution for t = 1, and lambda = 10
plot(0 : 20, dpois(0 : 20, lambda = 10), type = "h", frame = FALSE)
# Poisson distribution for t = 1, and lambda = 100
plot(0 : 200, dpois(0 : 200, lambda = 100), type = "h", frame = FALSE)rga package that can be found at http://skardhamar.github.com/rga/# load data
load("./data/gaData.rda")
# convert the dates to proper formats
gaData$julian <- julian(gaData$date)
# look at the data
head(gaData)## date visits simplystats julian
## 1 2011-01-01 0 0 14975
## 2 2011-01-02 0 0 14976
## 3 2011-01-03 0 0 14977
## 4 2011-01-04 0 0 14978
## 5 2011-01-05 0 0 14979
## 6 2011-01-06 0 0 14980
# plot visits vs dates
plot(gaData$date,gaData$visits,pch=19,col="darkgrey",xlab="Date",ylab="Visits")# plot the visits vs dates
plot(gaData$date,gaData$visits,pch=19,col="darkgrey",xlab="Date",ylab="Visits")
# perform linear regression
lm1 <- lm(gaData$visits ~ gaData$date)
# plot regression line
abline(lm1, col="red", lwd=3)round(exp(coef(lm(I(log(gaData$visits + 1)) ~ gaData$date))), 5)## (Intercept) gaData$date
## 0.00000 1.00231
glm(outcome~predictor, family = "poisson") = performs Poisson regression# plot visits vs dates
plot(gaData$date,gaData$visits,pch=19,col="darkgrey",xlab="Date",ylab="Visits")
# construct Poisson regression model
glm1 <- glm(gaData$visits ~ gaData$date,family="poisson")
# plot linear regression line in red
abline(lm1,col="red",lwd=3)
# plot Poisson regression line in blue (represents mean number of visits per day, which is lambda)
lines(gaData$date,glm1$fitted,col="blue",lwd=3)# plot residuals vs fitted values
plot(glm1$fitted, glm1$residuals, pch=19, col="grey", ylab="Residuals", xlab="Fitted")glm(outcome~predictor, family = "quasi-poisson") = introduces an additional multiplicative factor \(\phi\) to denominator of model so that the variance is \(\phi \mu\) rather than just \(\mu\) (see Variances and Quasi-Likelihoods)sandwich package, is one way to calculate the robust standard errors
# load sandwich package
library(sandwich)
# compute
confint.agnostic <- function (object, parm, level = 0.95, ...)
{
cf <- coef(object); pnames <- names(cf)
if (missing(parm))
parm <- pnames
else if (is.numeric(parm))
parm <- pnames[parm]
a <- (1 - level)/2; a <- c(a, 1 - a)
pct <- stats:::format.perc(a, 3)
fac <- qnorm(a)
ci <- array(NA, dim = c(length(parm), 2L), dimnames = list(parm,
pct))
ses <- sqrt(diag(sandwich::vcovHC(object)))[parm]
ci[] <- cf[parm] + ses %o% fac
ci
}
# regular confidence interval from Poisson Model
confint(glm1)## 2.5 % 97.5 %
## (Intercept) -34.346577587 -31.159715656
## gaData$date 0.002190043 0.002396461
# model agnostic standard errors
confint.agnostic(glm1)## 2.5 % 97.5 %
## (Intercept) -36.362674594 -29.136997254
## gaData$date 0.002058147 0.002527955
exp(confint(glm1)) to get more interpretable values. idx <- 1:60
par(mfrow=c(1,2))
plot(visits ~ date, gaData[idx,],
main='"Zero Inflation" 2011',
xlab="Date", ylab="Visits", pch=21, bg='darkgrey')
lines(gaData$date[idx], glm1$fitted.values[idx], lwd=5, col="red")
points(as.Date("2011-01-10"), 5, cex=12, lwd=5, col="red")
text(as.Date("2011-01-5"), 9, "Zero Inflation", pos=4)
plot(gaData$date, gaData$visits-glm1$fitted.values, pch=21, bg='darkgrey', main="Variance != Mean?", xlab="Date", ylab="Visits over Average")
lines(gaData$date, sqrt(glm1$fitted.values), lwd=5, lty=2, col='red')
lines(gaData$date, -sqrt(glm1$fitted.values), lwd=5, lty=2, col='red')simplystats column of our data records the number of visits to the Leek Group site which come from the related site, Simply Statistics. (ie., visits due to clicks on a link to the Leek Group which appeared in a Simply Statisics post.)par(mfrow=c(1,1))
plot(visits ~ date, gaData, pch=21, bg='darkgrey', main="Bursts of Popularity", xlab="Date", ylab="Visits")
points(simplystats ~ date, gaData, pch=19, col='red')
lines(simplystats ~ date, gaData, lwd=1.5, col='darkgreen')
legend('topleft', c("Visits", "Visits from Simply Statistics"), pch=c(21,19),
pt.bg=c("darkgrey", "black"), col=c("black","red"), bg="white")which.max(hits[,'visits']).# maximum number of visits at this date
idx <- which.max(gaData[,"visits"])
gaData[idx,]## date visits simplystats julian
## 704 2012-12-04 94 64 15678
qpois(.95, lambda).# expected number of visits (= lambda) at date with maximum visit
lambda <- glm1$fitted.values[idx]
# number of visits at the 95th percentile given by our model
qpois(.95, lambda)## [1] 33
offset, to model frequencies and proportions.offset so that the model is interpreted with respect to the number of visits
glm(outcome ~ predictor, offset = log(offset), family = "poisson") = perform Poisson regression with offsetglm(outcome ~ predictor + log(offset)) = produces the same resultoffset, fixes the coefficient of the offset to 1, so that the fraction simplystats/visits can hold# perform Poisson regression with offset for number of visits
glm2 <- glm(simplystats ~ date, offset=log(visits+1), family="poisson", data=gaData)
# plot the fitted means (from simply statistics) in BLUE
plot(gaData$date, glm2$fitted,col="blue", pch=19,xlab="Date", ylab="Fitted Counts")
# plot the fitted means (total visit) in RED
points(gaData$date, glm1$fitted, col="red", pch=19)# plot the rates for simply stats (actual data points)
plot(gaData$date, gaData$simplystats/(gaData$visits+1), col="grey",xlab="Date",
ylab="Fitted Rates", pch=19)
# plot the fitted rates for simply stats (visit/day) (fitted model)
lines(gaData$date, glm2$fitted/(gaData$visits+1), col="blue", lwd=3)summary(glm2) will show that the estimated coefficients are significantly different than zero, the model is actually not impressive. We can illustrate why by looking at December 4, 2012, once again. On that day there were 64 actual visits from Simply Statistics. However, according to glm2, 64 visits would be extremely unlikely. You can verify this weakness in the model by finding glm2’s 95th percentile for that day.# number of visits at the 95th percentile given by our model (should be 64)
qpois(.95, glm2$fitted.values[idx])## [1] 47
pscl package -
zeroinfl fits zero inflated Poisson (ziP) models to such dataIntro: Extend linear models so that we can fit complicated functions using regression models. We’ve seen a little bit in how to do this by adding squared and cubic or polynomial terms into our regression model so that we can fit lower order functions. However imagine if you have a complicated function like a sine curve or something that looks maybe like a sine curve. How would you fit that non-parametrically using a linear model?
# simulate data: sin curve + noise
n <- 500; x <- seq(0, 4 * pi, length = n); y <- sin(x) + rnorm(n, sd = .3)
# define 20 knot points, equally spread along the data
knots <- seq(0, 8 * pi, length = 20);
# define the ()+ function to only take the values that are positive after the knot pt
splineTerms <- sapply(knots, function(knot) (x > knot) * (x - knot))
# define the predictors as X and spline term
xMat <- cbind(x, splineTerms)
# fit linear models for y vs predictors
yhat <- predict(lm(y ~ xMat))
# plot data points (x, y)
plot(x, y, frame = FALSE, pch = 21, bg = "lightblue")
# plot fitted values
lines(x, yhat, col = "red", lwd = 2)# define the knot terms in the model
splineTerms <- sapply(knots, function(knot) (x > knot) * (x - knot)^2)
# define the predictors as x, x^2 and knot terms
xMat <- cbind(x, x^2, splineTerms)
# fit linear models for y vs predictors
yhat <- predict(lm(y ~ xMat))
# plot data points (x, y)
plot(x, y, frame = FALSE, pch = 21, bg = "lightblue")
# plot fitted values
lines(x, yhat, col = "red", lwd = 2)# frequencies for white keys from c4 to c5
notes4 <- c(261.63, 293.66, 329.63, 349.23, 392.00, 440.00, 493.88, 523.25)
# generate sequence for 2 seconds
t <- seq(0, 2, by = .001); n <- length(t)
# define data for c4 e4 g4 using sine waves with their frequencies
c4 <- sin(2 * pi * notes4[1] * t); e4 <- sin(2 * pi * notes4[3] * t);
g4 <- sin(2 * pi * notes4[5] * t)
# define data for a chord and add a bit of noise
chord <- c4 + e4 + g4 + rnorm(n, 0, 0.3)
# generate profile data for all notes
x <- sapply(notes4, function(freq) sin(2 * pi * freq * t))
# fit the chord using the profiles for all notes
fit <- lm(chord ~ x - 1)# set up plot
plot(c(0, 9), c(0, 1.5), xlab = "Note", ylab = "Coef^2", axes = FALSE, frame = TRUE, type = "n")
# set up axes
axis(2)
axis(1, at = 1 : 8, labels = c("c4", "d4", "e4", "f4", "g4", "a4", "b4", "c5"))
# add vertical lines for each note
for (i in 1 : 8) abline(v = i, lwd = 3, col = grey(.8))
# plot the linear regression fits
lines(c(0, 1 : 8, 9), c(0, coef(fit)^2, 0), type = "l", lwd = 3, col = "red")fft(data) = performs fast Fourier transforms on provided dataRe(data) = subset to only the real components of the complex data# perform fast fourier transforms on the chord matrix
a <- fft(chord)
# plot only the real components/frequencies of the fft
plot(Re(a)^2, type = "l")